arXiv:1509.07685v3 [physics.comp-ph] 8 Oct 2015 


The Dynamical Kernel Scheduler - Part 1 

Andreas Adelmann^*’*, Uldis Locans®’*’, Andreas Suter^* 

“Paul Scherrer Institut, Villigen, CH-5232, Switzerland 
^University of Latvia, 19 Raina Blvd., Riga, LV1586, Latvia 


Abstract 

Emerging proeessor arehiteetures sueh as GPUs and Intel MICs provide a huge 
performanee potential for high performanee eomputing. However developing 
software that uses these hardware accelerators introduces additional challenges 
for the developer. These challenges may include exposing increased parallelism, 
handling different hardware designs, and using multiple development frameworks 
in order to utilise devices from different vendors. 

The Dynamic Kernel Scheduler (DKS) is being developed in order to provide 
a software layer between the host application and different hardware accelerators. 
DKS handles the communication between the host and the device, schedules task 
execution, and provides a library of built-in algorithms. Algorithms available in 
the DKS library will be written in CUDA, OpenCL, and OpenMP. Depending on 
the available hardware, the DKS can select the appropriate implementation of the 
algorithm. 

The first DKS version was created using CUDA for the Nvidia GPUs and 
OpenMP for Intel MIC. DKS was further integrated into OPAL (Object-oriented 
Parallel Accelerator Library) in order to speed up a parallel LET based Poisson 
solver and Monte Carlo simulations for particle matter interaction used for pro¬ 
ton therapy degrader modelling. DKS was also used together with Minuit2 for 
parameter fitting, where and max-log-likelihood functions were offloaded to 
the hardware accelerator. The concepts of the DKS, first results, and plans for the 
future will be shown in this paper. 
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1. Introduction 


In recent years hardware accelerators have become increasingly popular within 
scientific computing. Based on the TopSOO list from June 2015 [1], 90 of the top 
500 supercomputers in the world are accelerator based. This includes the top two 
systems on the list: Tianhe-2 which uses Intel Xeon Phi coprocessors and Titan 
which uses NVIDIA K20x GPUs. GPU usage for general purpose computing has 
become even more important, due to the gaming industry. Almost every computer 
is now equipped with a GPU, but if the application is not exploiting the GPU, it 
is not using all the available computational power of the system. However, de¬ 
veloping software that can take advantage of hardware accelerators can become 
a challenging task, especially for large existing applications. Each hardware ac¬ 
celerator has its own architecture and memory hierarchy, which must be taken 
into account to gain the maximum performance out of the device. In addition to 
hardware differences, there are also varying methods to program these devices. 
NVIDIA provides the CUDA [2] toolkit for its GPUs, both AMD and NVIDIA 
support the OpenCL [3] framework, and Intel allows usage of standard tools and 
languages to program Intel MIC processor [4] . There are also OpenACC [5] and 
OpenMP [6] standards that allow the targeting of hardware accelerators by ex¬ 
pressing parallelism through compiler directives. 

In this work, the Dynamic Kernel Scheduler (DKS) is presented which pro¬ 
vides a slim software layer between the host application and the hardware accel¬ 
erators. DKS separates the accelerator and framework specific code from the host 
application and provides a simple interface that can be implemented in the host 
application to offload tasks to the accelerator. DKS provides functions to handle 
communication and data transfer between host and device, as well as a library 
of functions written in CUDA, OpenCL, and OpenMP that allow the targeting of 
different accelerators. The first version of DKS was integrated into OPAL (Object- 
oriented Parallel Accelerator Library). This DKS version uses CUDA kernels and 
OpenMP offload pragmas to run OPAL’s LPT based Poisson solver and Monte 
Carlo simulations on a GPU and Intel MIC. DKS was also used together with 
Minuit2 for parameter fitting, where and max-log-likelihood functions were 
offloaded to the hardware accelerator. 

In the literature, there are several LPT Based Poisson solvers developed for 
GPUs using CUDA which use NVIDIA’s cuPPT library [7, 8]. One can also find 
research on the use of customized PPTs for asynchronous execution and mapping 
LPT based Poisson solvers to multi node systems [9, 10, 11]. Numerous studies 
[12, 13, 14, 15, 16, 17] have been carried out to show the potential of GPUs and 
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Intel Xeon Phi co-processors for Monte-Carlo simulations for proton and photon 
transport. These problems are some of the most time consuming parts of the 
OPAL simulations, and previous research shows that they are good candidates for 
acceleration on the co-processors. 

Many research projects try to focus on improving programmability of hard¬ 
ware accelerators. Apart from compiler directive based approaches, there are a 
number of vendor supported libraries [18, 19] that allow the simplification of 
offloading specific tasks to accelerators. There has also been work on creat¬ 
ing abstractions and providing software layers that would allow to express ker¬ 
nels [20, 21, 22] for hardware accelerators, which can be translated to CUDA or 
OpenCL code that is run on the device. 

The ability of DKS to have implementations using different frameworks and 
libraries, and switch between them from the host application allows the targeting 
of hardware accelerators of different types and fine tuning of the code to gain 
the maximum performance from each device. This approach also provides moro 
portability and software investment protection for the host application. In case 
some hardware architecture is no longer manufactured, or some new architecture 
or development framework emerges, only DKS needs to be updated. 

The rest of the paper is structured as follows: Section 2 describes the concepts 
and architecture of DKS; Section 3 describes the concepts of OPAL’s FFT based 
Poisson solver and Monte-Carlo type particle matter interaction simulations as 
well as DKS implementation of these functions and the benchmark results; Sec¬ 
tion 4 explains the DKS and Minuit2 usage and results for parameter fitting 
using hardware acceleration; and Section 5 provides conclusions and future of the 
DKS. 

2. Concept and Architecture of the Dynamic Kernel Scheduler (DKS) 

2.1. Concept 

The Dynamic Kernel Scheduler (DKS) is a slim software layer between the 
host application and the hardware accelerator, as depicted in Figure 1 . The aim 
of the DKS is to allow the creation of fast fine tuned kernels using device specific 
frameworks such as CUDA, OpenCL, OpenACC and OpenMP. On top of that, 
DKS allows the easy use of these kernels in host applications without providing 
any device or framework specific details. This approach facilitates the integra¬ 
tion of different types of devices in the existing applications with minimal code 
changes and makes the device and the host code a lot more manageable. 

The architecture of DKS can be split in three main parts: 
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Application code 




Figure 1: The Dynamic Kernel Scheduler concept 


1. The first part provides communication functions that handle memory alloca¬ 
tion and data transfer to, and from, the device. All the memory management 
is left up to the user. So that the data transfers and memory allocation can 
be scheduled only when necessary. DKS also supports GPU streams such 
that asynchronous data transfer and kernel execution can be implemented 
when possible. 

2. The second part of DKS consists of a function library, which contains al¬ 
gorithms written in CUDA, OpenCL, and OpenMP to target different de¬ 
vices. DKS can switch between implementations based on the hardware 
that is available. Writing functions using multiple frameworks results in 
extra work, but this provides the opportunity to fine tune kernels for each 
device architecture for maximum performance. That also allows the tar¬ 
geting of systems containing different types of devices. The different im¬ 
plementations of the code are always separated so the code is still easy to 
manage. Additionally if a host application is targeted at a specific system, 
implementations that are not needed can be omitted. 

3. The third part of DKS is the auto-tuning functionality which will be dis¬ 
cussed in a forthcoming paper. The aim of auto-tuning is to select the ap¬ 
propriate implementation of the algorithm and change the launch param¬ 
eters according to the devices that are available on the system in order to 
gain the maximum performance. The auto-tuning functionality relies on 
knowledge of device architecture and benchmark tests that can be run on 
the system before running the application. 


4 
























//allocate memory on device and write data 
void *mem_ptr; 

mem_ptr = dks.allocateMemory<Complex_t>(DATA_SIZE, NULL); 
dks.writeData<Complex_t>{mem_ptr, DATA_ARRAY, DATA_SIZE); 

//execute FFT or IFFT 
if (direction == 1) 

dks.callFFT{mem_ptr, DIMENSIONS, DIM_SIZE); 
else 

dks.calllFFT(mem_ptr, DIMENSIONS, DIM_SIZE); 

//read data and free memory 

dks.readData<Complex_t>{mem_ptr, DATA_ARRAY, DATA_SIZE); 
dks.freeMemory<Complex_t>(mem_ptr, DATA_SIZE); 


Figure 2: Example of DKS usage for FFT 

Figure 2 shows an example eode of DKS usage inside a host applieation to 
perform Fast Fourier transform. The host applieation has full eontrol over the 
memory alloeation and data transfer to the deviee, but there are no deviee speeifie 
details in the host eode. DKS evaluates the ealls made by host applieation and 
ehooses the appropriate deviee to use, and algorithm implementation, to run the 
eode on seleeted aeeelerator. 

2.2. Architecture 

The Dynamie Kernel Seheduler is split into separate modules. Eaeh module 
eontains funetion implementations using different frameworks. The base elass for 
eaeh module eontains funetions whieh handles the deviee management, memory 
management, and data transfer, this base elass ean be extended to eover all the 
neeessary algorithm speeifie funetions. The base elass of DKS reeeives all the 
ealls from the host applieation and deeides whieh deviee speeifie implementation 
should be used to run the eode on the deviee. Figure 3 shows the arehiteeture of 
the first version of DKS, for eaeh module base elass ean be easily extended to 
inelude other algorithms and the base elass of DKS ean be extended to inelude 
other modules to handle different development frameworks. 

3. The Dynamic Kernel Scheduler used in Object-oriented Parallel Acceler¬ 
ator Library (OPAL) 

OPAL (Object Oriented Particle Accelerator Library) is a parallel, open source 
C++ framework for general particle accelerator simulations which includes 3D 
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Figure 3: Architecture of the Dynamic Kernel Scheduler. The OpenCL implemen¬ 
tation and the auto tuning framework are both shown in red and will be discussed 
in a subsequent publication. 


space charge, short range wake fields, and particle matter interaction. OPAL is 
based on IPPL (Independent Parallel Particle Layer) which adds parallel capabil¬ 
ities. Main functions inherited from IPPL are structured rectangular grids, fields, 
parallel FFT, and particles with the respective interpolation operators. Other fea¬ 
tures are expression templates, and massive parallelism (up to 65000 processors) 
which allows it to tackle the largest problems in the field. 

3.1. FFT Based Particle-Mesh Solver 

The Particle-Mesh (PM) solver solver is best described in the book by R.W. Hock¬ 
ney & J.W. Eastwood [23]. Instead of calculation the mutual interaction of a large 
number of particles, in the PM solver one discretises the computational domain 
n := [—Ox, Ox] X [—tty, tty] X [—ttz, ttz] mto a regular mesh of Mx x My x Mz grid 
points. The beam sizes Ox, Oy, Oz are usually time dependent. Other geometries 
are possible but not discussed here. The mesh sizes hx^ hy, and hz are allowed 
to change independently over time to assure a particle fitted grid. An essential 
part of any self-consistent electrostatic beam dynamics code is the Poisson solver. 
From the - time to solution - point of view, we observe that in the order of 1/3 of 
the computational time is spend in this algorithm. 

In many of the physics application, the bunch can be considered as small com- 
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Figure 4: The OPAL software structure and connection to DKS 


pared to the transverse size of the surrounding beam pipe (dfl). If this is the case 
the conducting walls can be neglected and, we can solve an open boundary prob¬ 
lem. Here we follow the method of Hockney and compute the potential on a grid 
of size 2'^M^MyMz, assuming 3 spatial dimensions of the problem. The calcu¬ 
lation then is making use of Fast Fourier Transform (FFT) techniques, with a 
computational effort scaling as 0{2^MxMyMzi\og2 2MxMyMz)^) [23, 24, 25]. 

3.1.1. FFT-based Convolutions 

Given a charge density p, we search or the scalar potential 0 by solving Pois¬ 
son’s equation 

VV = -p/£o, 

subject to 0 = 0 at 00 —)■ cx), i.e. in an unbound domain. If we know the Green’s 
function G{x, x', y, y', z, z'), then the solution 


(t){x,yG) = 



dx'dy'dz'p{x', y', z')G{x, x', y, y\ z, z') 


is the convolution of the a source charge at (x', y', z') and G. In our case of an 
isolated charge distribution, we get 


(f>{x,y,z) = 



dx' dy' dz' p{x', y', z')G{x — x\y — y\ z — z'), (1) 


with 


G(m, n, w) 


1 

\/V? + v'^ + w'^ 
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We now discretise Eq. (1) on the previous mentioned Cartesian grid 

My 

4^1,j,k hxhyhz ^ ^ ^ ^ ^ ^ Pi',j'jk'Gi—i'j—ji. ( 2 ) 

i'=l j'=l k'=l 

The two sealar fields pij^k and Gi_i'j_j'^k-k' are now defined on the grid and we 
effieiently obtain the solution of Eq. (2) using Eourier teehniques by 

= hxhyh^ ET'^{(ET{pij-fc}) ® (ET{Gij-J)}, 

with 0 denoting the Hadamard produet. The notation ET{.} for the forward EET 
and ET“^{.} for the inverse EET is used. 

3.1.2. The DKS Implementation of the Poisson Solver 

Eor use on NVIDIA GPUs the EET Poisson solver is implemented in DKS 
using CUDA. It uses NVIDIA’s euEET library to perform the EET, separate ker¬ 
nels to ealeulate the Greens funetion and, perform the multiplieation on the GPU. 
CUDA streams are used to overlap the transfer of the p field to the GPU and the 
ealeulation of the Greens funetion. The sequenee diagram in Eigure 5 shows the 
steps exeeuted for the EET Poisson solver on the host and GPU. In ease multiple 
host eores are sharing a GPU deviee, CUDA inter-proeess eommunieations are 
used to share the deviee memory between the multiple MPI proeesses on the same 
node. Eor the EET Poisson solver, one of the MPI proeesses aets as a main proeess 
and initialises memory alloeation and kernel exeeution on the deviee. The other 
MPI proeesses meanwhile only send and reeeive data to and from the GPU. 


Host 


Stream 1 


Stream 2 



Eigure 5: EET-Poisson solver sequenee diagram 


3.1.3. Performance Results 

To test OPAE’s performanee, we use a similar problem setup as reported in 
[26, 27]. The test system eonsists of a host with two Intel Xeon e5-2609 v2 
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processors and a Nvidia Tesla K20 or Tesla K40. On the host, 8 CPU eores are 
available. The first simulations where run using only the CPUs available on the 
host. However in the seeond ease, DKS is used to offload the FFT Poisson solver 
to the GPU. 


Table 1: FFT Poisson Solver results 


FFT size 

DKS 

Total time (s) 

OPAL 

speedup 

FFTPoisson 
time (s) 

FFTPoisson 

speedup 

64x64x32 

no 

K20 

324.98 

311.17 

xl.04 

22.53 

7.42 

x3 


K40 

293.7 

xl.lO 

7.32 

x3 

128x128x64 

no 

K20 

434.22 

262.74 

xl.6 

206.73 

32.15 

x6.5 


K40 

245.08 

xl.8 

25.87 

x8 

256x256x128 

no 

K20 

2308.05 

625.37 

x3.6 

1879.84 

202.63 

x9.3 


K40 

542.73 

x4.2 

160.87 

xll.7 

512x512x256 

no 

K40 

3760.46 

716.86 

x5.2 

3327.14 

302.49 

xll 


Table 1 shows the results of these test runs for multiple problem sizes. The 
results show that offloading FFT Poisson solver to GPU ean provide a substantial 
speedup even when we have multiple CPU eores sharing one aeeelerator. The 
limiting faetor for the performanee of the FFT Poisson solver is the data transfer 
from the host side to the deviee. Sinee data needs to be moved to and from GPU at 
every time step, for the largest problem size reported in the benehmark tests, data 
transfer ean take up to 55% of the total simulation time. Another limiting faetor 
is the performanee of the FFT transform. FFT is a memory bound algorithm and 
is able to reaeh only a fraetion (about 10% was observed on our test system) of 
the deviees peak performanee. Sinee for the Poisson solver time to perform FFT 
takes up to 80% of all time spent for ealeulations, the speed of the solver depends 
severely on the speed of the FFT. 

3.2. Particle Matter Interaction 

One of the features in OPAL is the ability to perform Monte Carlo simula¬ 
tions of the partiele beam interaetion with matter. A fast eharged partiele moving 
through the material undergoes eollisions with the atomie eleetrons and loses en¬ 
ergy. In addition, partieles are also defleeted from their original trajeetory due to 
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Figure 6: Particle matter interaction. With the final energy E < Eq and larger 
momenta spread due to Coulomb scattering and the large angle Rutherford scat¬ 
tering. 


the Coulomb scattering with nuclei, as shown in figure 6. The energy loss in OPAL 
is calculated using Bethe-Bloch formula and the change of particle trajectory is 
simulated using Multiple Coulomb Scattering and Single Rutherford Scattering 
[28, 29, 30]. At every time step when the particle beam is inside a material, the 
following steps are executed: 

• calculate the energy loss of the beam, 

• delete the particle if the particle’s kinetic energy is smaller than a given 
threshold, 

• apply Coulomb & Rutherford scattering to the beam. 


3.2.1. The Energy Loss 

The energy loss is calculated using the following Bethe-Bloch equation: 


— dE/dx 


Kz^Z 

A(3^ 



2meC^(3‘^'y^T max 
P 



(3) 


where Z is the atomic number of absorber, A is the atomic mass of absorber, me is 
the electron mass, and 2 ; is the charge number of the incident particle. K is defined 
as AnNArlmeP, where rg is the classical electron radius, Na is the Avogadro’s 
number, and / is the mean excitation energy. (3 and 7 are the kinematic variables. 
















Lastly Tmax is the maximum kinetic energy that can be imparted to a free electron 
in a single collision. It is defined as, 


max - 27 me/M + {mjMY ’ 


(4) 


where M is the incident particle mass. 

For relatively thick absorbers, the number of collisions is large, and therefore 
the energy loss distribution is Gaussian in form. For non-relativistic heavy parti¬ 
cles, the spread, cxo, of the Gaussian distribution is calculated by: 

al = ATTNArlmlc^p^As, (5) 

where p is the density and s is the thickness of the material. 


3.2.2. Coulomb Scattering 

The Coulomb scattering is treated as two independent events: the multiple 
Coulomb scattering and the large angle Rutherford scattering. 

Using the distribution given in [31], the multiple- and single-scattering distri¬ 
butions can be written as: 


1 _ 2 

PM{o)da = —;=c " da, (6) 

V 7r 

n / \ 7 1 da 

“ 81n(204Z-V3)^’ 

where a = 10^172 = The transition point between multi and single 

scattering occurs at the angle 6 = 2.5a/26'o ~ 3.500, where value of Oq is the 
scattering angle from Moliere’s theory and is defined as, 

00 = — ~Ycp —^ 3 / '^'S/^o[l + 0.038 hi(As/Xo)], (8) 

where p is the momentum. As is the step size, and Xq is the radiation length. 

To perform a Monte Carlo simulation for the multiple Coulomb scattering two 
independent Gaussian random variables (zi and Z 2 ) are created with mean zero 
and variance one. The new position and momentum can then be calculated by: 

X = X-f Aspa;-f ziAs0o/\/l2-f Z2 As0o/2, (9) 

Px = Px + Z20O- (10) 

The values for the y — Py plane are calculated with the very same Monte-Carlo 
algorithm. 
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3.2.3. Large Angle Rutherford Scattering 

Only a small percentage of particles undergo large angle Rutherford scattering. 
This percentage is given by: 


Xsingle < „ 2.5 


f^^Fs(a)da 


TD. 




) 


= 0.0047. 


( 11 ) 


The process to define if a particle undergoes a Rutherford scatter is as follows: 


• A random number between 0 and 1 is generated. If and only if this ran¬ 
dom number is smaller than Xsingie the particle undergoes single Rutherford 
scattering. The value of Xsingie does not change significantly for different 
materials, hence a fixed value of Xsingie = 0.0047 is used, in order to avoid 
unnecessary computation. 


• A second random variable .^2 between 0 and 1 is generated to calculate the 
angle, the particle rotates about. 


• The third and last random number ^3 determines the direction of the rota¬ 
tion: 


[+2.5yjv^0o if 6 <0.5 

6ru = \ ^! — ( 12 ) 

[-2.5^v^0o if e3>0.5. 

3.2.4. The DKS Implementation of the Particle Matter Interaction Model 

For particle matter interactions, DKS has CUDA and OpenMP implementa¬ 
tions of all the algorithm steps described above. This allows the computation of 
the energy loss, the Coulomb scattering, and the Rutherford scattering to be of¬ 
floaded to the GPU or Intel MIC. On top of particle matter interaction, DKS is 
also able to offload to the accelerators the transport of particles before and af¬ 
ter the material using a time integration scheme. The sequence diagram for the 
integration is shown in Figure 7. 

To increase the performance, the data transfer is minimised as much as pos¬ 
sible and particles that are drifting before or after the degrader are kept on the 
device. They are updated only when there are a some particles returning from the 
material or there has been an MPI update to balance the workload between MPI 
processes. Pinned host memory and streams are used with the GPU version to in¬ 
crease the date transfer speed, and overlap the data transfer and kernel execution 
for the particle drift. 
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Figure 7: Integration sequence diagrams 


Particles that are in the material are also kept in the device memory. NVIDIAs 
cuRAND and Intels MKL VSL libraries are used to generate random numbers to 
determine the necessary distributions for energy loss and scattering. NVIDIA’s 
Thrust library is used to sort and count the particles on the GPU in order to man¬ 
age the particles that need to come out of the material, but also to exclude the 
dead particles from Monte-Carlo simulations. Because of the high complexity of 
the algorithm, the CUDA version uses shared device memory for variable storage 
to reduce the register pressure of the kernels in order to achieve higher GPU oc¬ 
cupancy. Structure of arrays data layout is used to store all the particles in order 
to allow Intel compiler to better vectorise the code for the Xeon Phi coprocessor. 
The sequence diagram of the degrader simulations on the CPU and the device is 
shown in Figure 8. 



Figure 8: Degrader sequence diagrams, where B denotes particle bunch 


3.2.5. Performance Results 

To test the OPAL and DKS Monte Carlo simulations, an example was run 
where particles are shot through a L = 1 cm thick graphite slab. This mimics a 
degrader device used in proton therapy. Timings where obtained for the integra¬ 
tion of the equation of motion, before and after the material, as well as when the 
particles are moving through the material. The system setup is the same as was 
used for the FFT Based Particle-Mesh (PM) Solver benchmark, but, because of 
limitations of OPAL, simulations where run using only one CPU core. 
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Table 2 shows the benehmark results for the degrader and the integration using 
various number of partieles. The speedup of the partiele transport through the 
material is xl40 to xl70 eompared to the one eore of the host proeessor, while 
the integration is able to aehieve a speedup of around x 20 to x 30 on the GPU. 
The Intel MIC on the other hand shows a speedup of x40 for the degrader and x 7 
for the integration eompared to the host. 

Table 2: OPAL degrader results 


Particles 

DKS 

Degrader time (s) 

Degrader 

speedup 

Integration 
time (s) 

Integration 

speedup 


no 

20.30 


3.46 


10^ 

MIC 

2.29 

x8 

0.89 

x4 

K20 

0.28 

x72 

0.15 

x23 


K40 

0.19 

xl07 

0.14 

x24 


no 

206.77 


34.93 


10^ 

MIC 

5.38 

x38 

4.62 

x7.5 

K20 

1.41 

xl46 

1.83 

xl9 


K40 

1.18 

xl75 

1.21 

x29 


no 

2048.25 


351.64 


10^ 

K20 

14.4 

xl42 

17.21 

x20 


K40 

12.79 

xl60 

11.43 

x30 


The limiting faetor for GPU/MIC performanee for the integration is the data 
movement. This operation requires data to be sent to the deviee and reeeived from 
the deviee, at every time step. In the GPU ease kernel exeeution ean be eom- 
pletely overlapped with the data movement, and thus only limited by the memory 
bandwidth of our deviee. 

The limiting faetor for partiele movement through the material is global mem¬ 
ory aeeess times. Eaeh exeeution of the kernel requires a load of position and 
momentum veetors as well as the state of the random number generator for the 
thread. When the kernel finishes position, momentum, random number state and 
possibly partiele state needs to be written baek to global memory. If there are any 
dead partieles, or partieles that are eoming out of the material, these partieles need 
to be removed from the buneh on the aeeelerator. This requires a sorting of the 
partieles whieh also requires a lot of memory movement on the deviee and limits 
the performanee. 
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4. Parameter Fitting with Minuit2 

Minuit2 is a C++ library allowing a multi-parameter minimisation of a user- 
defined funetion [32]. It is a re-implementation of the FORTRAN library MlNUlT 
[33], a very popular minimisation paekage used by high energy physicists. In 
addition to minimisation algorithms, it contains methods for analysing the solu¬ 
tions and can estimate the parameter error correlation matrix. These combined 
capabilities are very difficult to find in other existing minimisers. Its drawback 
for inexperienced users is that the user-defined function needs to be implemented, 
compiled, and linked. This is a common practice in high energy physics, but is 
less common in the solid state physics community. Therefore, for the /rSR com¬ 
munity, the Musrfit framework [34] has been developed. This framework eases 
the analysis of muon spin rotation, relaxation, and resonance (/rSR) experiments 
by allowing user to define all the relevant input parameters and functions for Ml- 
NUIt2 in a scripting manner. We will describe the problem for the specific needs 
of /iSR, however the problem and the described solution is much more generic. 

4.1. Problem Description 

In a time differential /iSR experiment [35], ~ 100% polarised positive muons 
(/i+) are implanted in a solid sample and rapidly thermalise (~ 10 ps) without 
noticeable polarisation loss. The spin evolution of the muon ensemble after im¬ 
plantation is then measured as a function of time. The evolution can be monitored 
by using the fact that the parity violating muon decay is highly anisotropic. Dur¬ 
ing the decay an easily detectable positron is emitted preferentially in the direction 
of the jjP' spin. It takes the form 


N^{t) = (13) 

where the time is measured in discrete steps f = n ■ At [n G Nq, At the time 
resolution], j indexes the positron detectors and the function A-^(p,f) describes 
the “physics” of the system under consideration. For details about the function 
A-^ (p, t), the reader is referred to Ref. [35]. The muon lifetime is given by and 
Nq gives the scale of the positron count. Lastly the constant originates from 
uncorrelated background events. For a given positron histogram, j, the optimal 
parameter set 


( 1 + 

needs to be determined. Depending on the level of statistics of the positron his¬ 
tograms, P is determined with a minimisation: 
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( 15 ) 


x^(p) = EE 

j n 





where are the measured data points of positron deteetor. The theory de- 
seribing the data is given by Eq. (13), and d^ err is the estimated error of 

(di gj,j. = for the Poisson distributed positron histogram). 

For data sets with rather limited statisties, Eq. (15) is not leading to satisfaetory 
results. In this ease the log-likelihood funetion 


^ ^ 2 ^ ^ I [iVf (n ■ At, P) - di] + di log 

j n \ [N^{n ■ At, P) - d{] 4 < 0 

(16) 

should be maximised and lead to mueh better estimates of P. 

In reeent years, ehanges in the deteetor teehnology allow higher time resolu¬ 
tion (smaller deteetable At) at the expense of higher deteetor fragmentation (more 
positron eounters). This is leading to mueh larger data sets, and as a eonsequenee 
the minimisation/maximisation times are exploding. This is espeeially true if a 
parameter error estimate is needed that goes beyond the simple Hessian approaeh. 

The Minuit2 library is almost perfeetly suited to taekle the problem. The 
user needs to implement Eqs. (15) and/or (16) on the aeeelerator(s), whereas the 
minimisation proeess is exeeuted on the host. The main, and most time eonsum- 
ing, part, in the ealeulation is given by Eqs. (15) and (16). The only data transfer 
needed between the host and the aeeelerator is given by the small parameter set 
P, whieh should not lead to a bottleneek in the overall eomputation time. 

4.1.1. The DKS Implementation 

For parameter fitting with Minuit2, DKS is used to offload the max-log- 
likelihood, and the user defined funetion ealeulations to the GPU. The ealeulated 
value is passed to Minuit2 whieh varies the parameter set and returns the new 
parameters. Data transfer from the host to the deviee is minimal. Measurement 
data are transferred to the GPU only at the beginning of the ealeulations. Only 
the small parameter array is transferred to the deviee before every ealeulation and 
only the or max-log-likelihood value is returned by the deviee. CUDA and 
OpenCE are implemented to support various hardware aeeelerators. 


di 


N\n-At, P) 


4 > 0 
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4.1.2. Performance Results 

Parameter fitting tests with Minuit2 and DKS were performed on the same 
maehine as OPAL tests. The host eode used 8 CPU eores and parallelisation was 
done using OpenMP The deviee eode was run on a Nvidia Tesla K20 and Tesla 
K40 using CUDA. Results for offloading and max-log-likelihood funetions 
are shown in the Table 3. Sinee there is almost no data transfer involved in the 
program and the algorithm is easy to parallelise it is a good eandidate for GPU 
acceleration. The time to solution on the GPU is around 300 to 400 times faster 
for and around 150 to 200 times faster for max-log-likelihood functions than 
currently used OpenMP implementation. From the results presented in Tab. 3, 
A^{p, t) was chosen as 


A^{p,t) = Aiexp 



cos(7^5f 4- (f>), 


(17) 


with j = 1 ... 16, Aq are the asymmetries of each positron detector, a the depo¬ 
larisation rate of the muon spin ensemble, 7 ^ the gyromagnetic ratio of the muon, 
B the magnetic induction at the muon stopping site, t the time [see Eq.(13)], and 
(jf the phase of the initial muon spin in respect to the positron detector. Eq.(17) 
is a typical muon polarisation function to determine the magnetic shift of a para- 
/diamagnetic material (see Ref. [35]). Eor the given number of positron detectors, 
66 fitting parameters needed to be determined [see Eqs.(13), (17)]. 


5. Conclusion and Outlook 

In this paper we presented the first version of the Dynamic Kernel Scheduler 
which provides a software layer between the host application and the hardware 
accelerators. This creates a fine tuned code for different hardware accelerators 
using different frameworks and allows to easily integrate it into existing host ap¬ 
plications. DKS was integrated into OPAE to offload EET based Poisson solver 
and Monte-Carlo simulations for particle-matter interactions to a GPU and a Intel 
MIC using either CUDA or OpenMP. DKS was also used together with Minuit2 
for parameter fitting, where x^ and max-log-likelihood function calculations were 
offloaded to a GPU using CUDA or OpenCE. The result of this work shows that 
DKS can be used to substantially speed up existing host applications with mini¬ 
mal additions and changes to host code. Separating the device specific code in a 
different layer allows managing and fine tuning of the code more easily and keeps 
the host application significantly more portable since all the device and framework 
specific details are handled by DKS. 
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Table 3: Minuit2 parameter fitting with and max-log-likelihood (MLE) fune- 
tion running on the GPU. The given time is for the exeeution of the mi grad 
eommand of Minuit2 [32]. 


Data Set Size 

DKS 

(s) 

Speedup 

MLE (s) 

Speedup 

~ 1,300,000 

no 

K20 

K40 

157.077 

0.55012 

0.432707 

x285 

x363 

446.444 

2.75018 

2.31758 

xl62 

xl93 

~ 1,700,000 

no 

K20 

K40 

264.279 

0.863937 

0.675295 

x306 

x391 

664.893 

4.16143 

3.48775 

xl60 

xl90 

~2,200,000 

no 

K20 

K40 

392.727 

1.32307 

1.02269 

x296 

x384 

741.114 

5.2768 

4.41817 

xl40 

xl68 

~3,300,000 

no 

K20 

K40 

859.339 

2.52918 

1.92734 

x339 

x446 

1101.62 

7.60934 

6.30366 

xl44 

xl75 
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